knitr::opts_chunk$set(echo=TRUE, warning=FALSE,
                      message=FALSE, eval=TRUE, cache=TRUE,
                      fig.width=20, fig.height=20)
library(ncdf4)
## Helper function
windowsmooth <- function(x, n){
  ## x = rnorm(100)
  ## n=10
  ii = seq.int(from=0, to=length(x)-n, length=n)
  jj = ii + n
  inds = Map(function(a,b)(a+1):b, ii, jj)
  sapply(inds, function(ind)mean(x[ind]))
}

Model setup

(Heavily borrowed from the Paul’s Bayesian workshop slides on matrix population modeling here slides

In cell-level (cytometry) data collected over time from the ocean, one interesting derived dataset is the empirical distribution of biomass (or counts) over time. Using scientific intuition about the cells’ interaction with the environment, we can model the evolution of the empirical biomass distribution over time. This lends valuable empirical evidence of our theoretical understanding of cells’ life cycles in the ocean.

We describe a matrix population model.

First, for simplicity, assume empirical biomass distribution is discrete and have \(m\) size classes:

  • \(m\): number of size classes
  • \(v_{\min } \quad\): minimum cell diameter.
  • \(\Delta_{v}\): determines size class spacing, where \(\Delta_{v}^{-1}\) must be an integer

The \(i\)’th size class has a diameter between \(v_i\) and \(v_{i+1}\), which are defined as:

\[ v_{i}=v_{\min } 2^{(i-1) \Delta_{v}} \text { for } i=1, \ldots, m+1.\]

The empirical distribution at time \(t\) is:

\[w(t) \in \mathbb{R}^{m},\]

whose \(i\)’th entry (\(i=1,\cdots,m\)) is the normalized counts (or amount of biomass) in the \(i\)’th size class; normalization is done so that \(\sum_{i=1}^{n} w_{i}(t)=1\).

The main mechanism for evolution over time is through the time propagation matrix:

\[A\left(t, \Delta_{t}\right) \in \mathbb{R}^{m \times m} \quad\]

defined as the transition of the distribution from time \(t\) to a later time \(t+\Delta_t\). This matrix will directly encode three basic type of cell activities.

  1. Growth: The entries in the blue, which are subdiagonal entries, represent a transition to the next larger size class.

  2. Division: The entries in the orange cells, \(j\) away from the diagonal, represent a division of cells so that biomass/counts transition from the \(j\)’th cell to the size class in the diagonal, representing an exact halving of the cell size. (The value of \(j\) is determined by the diameters in the size class)

  3. Stasis: The diagonal entries. The rest of the cells remain at their size class.

Using this evolution mechanism, the size distribution at the next time is defined as:

\[ w\left(t+\Delta_{t}\right)=\frac{A\left(t, \Delta_{t}\right) w(t)}{\left\|A\left(t, \Delta_{t}\right) w(t)\right\|_{1}},\]

which is the normalized size distribution after having evolved by a left multiplication by \(A(t, \Delta)\).

Model details

Growth function

Driven directly by the sunlight.

\[\gamma\left(t, \Delta_{t}\right)=\Delta_{t} \gamma_{\max }\left(1-\exp \left(-\frac{E(t)}{E^{*}}\right)\right)\]

TODO: be more detailed about this e.g. define the variables, explain this effect in words.

TODO: will growth be size dependent? Ask the group’s opinion.

Division

Possibly size-dependent, parametric or non-parametric. Splines are one way to go.

Respiration

One major missing component (in the model described thus far) could be cell respiration, another major known cell activity that directly affects the cell size distribution. There were several ideas

  1. A super-diagonal set of entries, or
  2. Growth rate allowed to be negative.

Datasets

Zinser data

Basically, two days’ worth of 2-hourly size distribution data collected in a controlled lab environment.

TODO: describe more rigorously here.

##' Helper to open Zinser data.
##' @param filename NC file filename, like
##'   \code{"data/Zinser_SizeDist_calibrated-52-12.nc"}.
##' @return List containing data \code{obs} (T x m matrix), \code{time} (T
##'   length list of minutes passed since the beginning), \code{num_sizeclass}
##'   (Number of size classes, or m), and \code{v} (left-hand-side border of
##'   size classes).
open_zinser <- function(filename = "Zinser_SizeDist_calibrated-52-12.nc",
                        datadir = "data"){

  ## Read data.
  ## "data/Zinser_SizeDist_calibrated-52-12.nc"
  nc <- nc_open(file.path(datadir, filename))
  PAR   <- ncvar_get(nc,'PAR') ## Sunlight
  w_obs <- ncvar_get(nc,'w_obs') ## Data

  ## List variables in the nc file.
  names(nc$var)
  
  ## What are these?
  m <- ncvar_get(nc,'m') ## Number of size classes  
  
  ## Time (in number of minutes since the beginning)
  time  <- ncvar_get(nc,'time') 
  
  ## Size classes
  delta_v_inv <- ncvar_get(nc,'delta_v_inv')
  v_min       <- ncvar_get(nc,'v_min')
  delta_v <- 1/delta_v_inv
  v <- v_min * 2^(((1:m) - 1) * delta_v)

  ## Name of w_obs
  rownames(w_obs) = time
  colnames(w_obs) = v

  ## Return
  return(list(obs = w_obs,
              time = time,
              num_sizeclass = m,
              v = v,
              nc = nc,
              PAR = PAR))
}

We can see that the growth and division is in concordance with the sunlight variable par, over the two day period of the data.

The bottom plot is another version of the data at a different number of size classes. There are half as many size categories (26 of them), and the same number of (\(25\)) time points.

## Retrieve data
for(filename in c("Zinser_SizeDist_calibrated-52-12.nc",
                  "Zinser_SizeDist_calibrated-26-6.nc")){

  ## Load data
  dat = open_zinser(filename)
  
  ## Plot *calibrated* Zinser data.
  ncat = ncol(dat$obs) ## alternatively, ncat = dat$num_sizeclass
  cols = colorRamps::blue2red(ncat)
  ylim = range(dat$obs)
  matplot(dat$obs, col = cols, lwd = seq(from = 0.1, to = 5, length = ncat),
          lty = 1, type = 'l', ylab = "", xlab = "")
  title(main = paste0("Zinser data, ", ncat, " size classes"))


  ## Making i_test
  d = 3
  i_test = rep(0, length(dat$time))
  i_test[seq(from=d, to=length(dat$time), by=d)] = 1

  ## Add grey regions for the time points that we're going to leave out.
  cex = 1
  abline(v = which(i_test == 1), col='grey', lwd=3, lty=1)
  for(ii in which(i_test == 1)){
    points(x=rep(ii, dat$num_sizeclass), y=dat$obs[ii,], pch=16, col="grey", cex=cex)
    points(x=rep(ii, dat$num_sizeclass), y=dat$obs[ii,], pch=16, col="white", cex=cex * 0.7)
    points(x=rep(ii, dat$num_sizeclass), y=dat$obs[ii,], pch=1, col="grey", cex=cex)
  }
  
  ## Add sunlight
  scpar = scale(dat$PAR)
  scpar = scpar - min(scpar)
  scpar = scpar / max(scpar) * max(dat$obs)
  lines(scpar, lwd = 5)

  ## Add legend
  legend("topright", lwd=3, col='grey', pch=1, legend="Test data")
}

(Lagrangian) Seaflow data

Collected from the ocean, following a parcel of water. Much more fine grained in time (TODO maybe we can utilize this?).

TODO: describe more rigorously here.

We can make a visualization of the seaflow data, from the file data/SeaFlow_SizeDist_regrid-15-5.nc in the github repository https://github.com/fribalet/Bayesian-matrixmodel .

## Setup
ndays            <- 4.0
limit_to_numdays <- 4.0
stride_t_obs     <- 10
data             <- list()
data$dt          <- 10
source(file.path('data_processing.r'))

## Visualize the data
ncat = nrow(data$obs)
cols = colorRamps::blue2red(ncat)
ylim = range(data$obs)

## Make two plots:
par(mfrow=c(2,1))
par = windowsmooth(data$PAR,  ncol(data$obs)) 

## Smaller sizes
iis=1:7
matplot(t(data$obs), type='l', lwd=2, lty=1, ylim=ylim, col='grey80',
        ylab="")
lines(par/max(par) * 0.2, lwd=3, lty=3)
cutoff = quantile(par/max(par), 0.2)
lines(pmin(par/max(par) * 0.2, cutoff), lwd=3, lty=1)
matlines(t(data$obs)[,iis], type='l', lwd=2, col=cols[iis], lty=1, ylim=ylim)
title(main="Smaller sizes")

## Larger sizes
iis=8:15
matplot(t(data$obs), type='l', lwd=2, col='grey80', lty=1, ylim=ylim,
        main="Larger sizes", ylab="")
lines(par/max(par) * 0.2, lwd=3, lty=3)
cutoff = quantile(par/max(par), 0.2)
lines(pmin(par/max(par) * 0.2, cutoff), lwd=5, lty=1)
matlines(t(data$obs)[,iis], type='l', lwd=2, col=cols[iis], lty=1, ylim=ylim)

We can see that the

  1. Small size distributions negatively correlate with the large size distributions,

  2. Their diel cycles are clearly visible,

  3. The small size distributions increase over night (because division is prominent), and

  4. The large size distributions increase with a slight lag to the sunlight cycle.

Comparison of the two datasets

  • TODO Qualitative comparison of the two datasets.
  1. Zinser data is experimental data, so division rate can be directly calculated (assumes no cell growth), and is taken to be the derivative of the summed size distribution.
  • Pro/con comparison?

Literature

  • TODO: Briefly summarize Sosik model.
  • TODO: Find more applications of matrix models elsewhere.

Model fitting

  • TODO: Describe the actual iterations (e.g. in 20-min intervals, one hour at a time)
  • TODO: Describe the Bayesian fitting and setup (Basically, I need to study it more)
  • TODO: Write another short section about model evalution (training/test split).

Bayesian model setup

At the minimum, we should set up the framework of:

  1. Prior: The various “parameters” of the matrix model include
  • \(E^*\): parameter in the growth function.
  • \(\sigma\): noise of the data, around the mean
  1. Data likelihood: The data are the observed size proportions \(\{w(t) \in \mathbb{R}^m\}_{t=1,\cdots,T}\). These measurements lie on a simplex i.e. \(\sum_{i=1}^m w(t)[i] = 1\) for all \(t=1, \cdots, T\).

We should look into whether we can model \(w(t)\) these directly as compositions. I don’t think this will be straightforward, or easy to write simply as a likelihood model.

In the meantime, we need a way to link the observed size proportions \(w(t)\) to those of our estimated \(\hat w(t)\) (whose posterior sample we observe). One way is to perform a log ratio transform, with respect to one category. Let’s say that that one category is the last size class. Then, the log ratio transform is

\[l_t[i] = log(\frac{y_t[i]}{y_t[n]}), i=1,\cdots, n\]

This is a mapping from \(\mathbb{R}^n\) to \(\mathbb{R}^{n-1}\), where the mapped values have no geometric constraint.

There is no obvious utility in terms of restoring normality or symmetry or bell shaped-ness – in fact, there is not that much evidence that the resulting posterior distribution of individual \(w(t)_i\)’s are pathologically non-bell-shaped (are they?). See an artificial simulation of \(T=1000\) Poisson-distributed 3-variate counts, transformed to compositions, we can see that the distributions are not pathological.

## Generate a bunch of 3-length probabilities, from data
lambda = 100
n = 3
N = 1000
pmat = t(sapply(1:N, function(ii){
  X = rpois(n, lambda)
  p = X/sum(X)
}))

par(mfrow=c(3,3))
for(phi in seq(from=10, to=180, length=9)){
  par(mar=c(1,1,1,1))
  plot3D::scatter3D(pmat[,1], pmat[,2], pmat[,3],
                    col = rgb(0,0,1,0.2),
                    pch = 16,
                    bty='g', phi=phi,
                    ticktype = "detailed") ## Using detailed ticks
}

## Do an *isometric* log-ratio transform
pmat_transformed = t(apply(pmat, 1, function(myrow){
  gm = prod(myrow)^(1/3)
  ## log(myrow[1:2]/myrow[3])
  log(myrow[1:2]/gm)
}))
## Before transformation
library(ggplot2)
df = data.frame(x=pmat[,2], y = pmat[,3])
p <- ggplot2::ggplot(df, aes(x, y)) + geom_point() + theme_classic() + ggtitle("Before transformation")
ggExtra::ggMarginal(p, type = "histogram") 

## MVN::mvn(pmat, multivariatePlot= "qq")
## title(main="2d Normal QQ plot")
## After transformation
df = data.frame(x=pmat_transformed[,1], y = pmat_transformed[,2])
p <- ggplot2::ggplot(df, aes(x, y)) + geom_point() + theme_classic()+ ggtitle("After log ratio transformation")
ggExtra::ggMarginal(p, type = "histogram")

## MVN::mvn(pmat_transformed, multivariatePlot= "qq")
## title(main="2d Normal QQ plot")

Rather, the problem is in that proportions have a constant sum of \(1\), so that the euclidean distance between two proportion vectors in \(\mathbb{R}^n\) can be a misleading (distorted) quantity. (For instance, when \(n=2\), then all measurements lie on a line. so the error is not sensible; instead, one might just as well omit the second category and see the estimation error in the first category; the log ratio transform achieves this effect).

Note, if we observe the total number of particles \(C_t\) as well, so that we know the total particles at \(W(t) = C_t w(t)\), then we might be able to use a multinomial model.

Some other leads: - https://link.springer.com/article/10.1007/s10260-020-00512-y - The Arcsine transformation: https://en.wikipedia.org/wiki/Cohen%27s_h - Aitchinson’s short book on compositions: - http://www.leg.ufpr.br/lib/exe/fetch.php/pessoais:abtmartins:a_concise_guide_to_compositional_data_analysis.pdf - http://www.sediment.uni-goettingen.de/staff/tolosana/extra/CoDa.pdf Currently the data model (i.e. likelihood model) is the sum of the data.

# From "stan/matrixmodel_sigmoidaldelta.stan"
...
    for(it in 1:nt_obs){
        if(i_test[it] == 1){
            diff = 0.0;
            for(iv in 1:m){
                diff += fabs(mod_obspos[iv,it] - obs[iv,it]);
            }
            diff = diff/sigma;
            log_like_test += normal_lpdf(diff | 0.0, 1.0);
        }
    }
    log_like_test = log_like_test/n_test;

This is saying the data likelihood is that of the SUM of the errors in estimating one category across ALL categories is distributed as a truncated normal:

\[ \sum_{i=1}^m |w(t)[i] - W(t)[i] | \sim \mathrm{truncNorm}(0, 1)\]

While this will achieve the basic effect of “small” It allows for a large error in some categories, because the sum of the absolute values

Why is it bad for us to use the same \(\sigma\) across categories? In other words, do you want different “standards” in terms of estimation, across categories?

Writing/questions

  1. Matrix model book; can I get an online copy? Other references:
  1. From Caswell papers, https://www.dropbox.com/s/i2axf4ricgzioo7/caswell-2019.pdf?dl=0, there are repeated reference to eigenvalues of the A matrix being linked to growth. What else is there to learn about this?
  2. Counts or biomass?
  3. It would be great to have an introductory figure showing what the data is, as early as possible in the paper.